feat(i18n): add Traditional + Simplified Chinese entity detection by lmanchu · Pull Request #945 · MemPalace/mempalace

lmanchu · 2026-04-16T09:43:42Z

Problem

zh-TW and zh-CN are shipped in mempalace/i18n/ but have no entity section. When a Chinese user runs:

detect_entities(paths, languages=("zh-TW",))

get_entity_patterns() silently falls back to English (i18n/__init__.py:231-233), so the English candidate pattern [A-Z][a-z]{1,19} is applied to Chinese text. Result: zero Chinese names extracted, only Latin-script names embedded in the Chinese document. ja and ko share the same bug (follow-up PRs).

Reproduction (before this PR)

from mempalace.entity_detector import extract_candidates

zh_text = "朱宜振 主持會議。朱宜振 同意 Jeffrey 的方案。朱宜振: 決定 ship。"
extract_candidates(zh_text, languages=("zh-TW",))
# → {}                    ← no Chinese names
extract_candidates(zh_text, languages=("zh-TW", "en"))
# → {"Jeffrey": 1}        ← only English name, misses 朱宜振 entirely

Approach

Add entity sections to zh-TW.json and zh-CN.json that work within the current framework's constraints:

candidate_pattern: common-surname-prefixed CJK n-grams. ~100 surnames covering >95% of Taiwanese and PRC names. Length is capped at {1,2} trailing chars so greedy matching doesn't swallow the trailing verb (e.g. 朱宜振說 → entity 朱宜振說 is wrong).
boundary_chars: \u4E00-\u9FFF: reuses the script-aware \b infrastructure from fix(entity_detector): script-aware word boundaries for combining-mark scripts #932. Applied to CJK, \b fires at CJK↔non-CJK transitions — the same mechanism Devanagari uses.
person_verb_patterns: Chinese verbs attach directly to the name with no whitespace, so patterns are written as {name}說, {name}問, {name}決定 — no \b or \s+ between them.
dialogue_patterns: full-width colon ：, Chinese quotes 「」『』, plus the standard Latin forms.
pronoun_patterns: 他 / 她 / 它 / 他們 / 她們 / 您 / 咱.
stopwords: ~140 entries — particles, pronouns, time expressions, question words, conjunctions, UI nouns, politeness forms.

What you get

# After this PR
zh_text = (
    "# 會議紀錄\n"
    "- 朱宜振 主持\n"
    "- Jeffrey Lai 報告融資\n"
    "朱宜振 跟 Jeffrey 討論 pitch。\n"
    "朱宜振: 「我們要 6 月 launch。」\n"
    "朱宜振 同意 Arnold 的方案。\n"
    "朱宜振 決定 ship pitch。\n"
    # ...8 more mentions...
)
detect_entities(..., languages=("zh-TW", "en"))
# people:    [('朱宜振', 0.99)]       ← correctly classified as person
# uncertain: [('Jeffrey Lai', 0.06), ...]

Known Limitation (documented in tests)

CJK scripts have no word delimiters. A name flanked by CJK on both sides with no punctuation or whitespace break is not extracted — the framework's \b(...)\b wrap can't fire between two CJK characters without a dictionary tokeniser. A test covers this adversarial case explicitly (test_zh_tw_known_limitation_inline_name_no_boundary).

In practice this rarely degrades recall: realistic Chinese technical writing has many non-CJK neighbours (bullet lines, inline English, full-width punctuation, newlines), so names that appear 3+ times across a document almost always land at a matchable boundary somewhere. Verified on a realistic zh-TW PKM note: 朱宜振 appearing in 8 sentences was extracted 11x with 0.99 person-classification confidence.

Testing

7 new tests in tests/test_entity_detector.py:
- test_zh_tw_candidate_extraction_at_boundaries
- test_zh_tw_person_classification
- test_zh_tw_stopwords_filter_common_particles
- test_zh_tw_falls_back_to_english_for_non_cjk_names
- test_zh_cn_candidate_extraction
- test_zh_cn_and_zh_tw_union_covers_both_variants
- test_zh_tw_known_limitation_inline_name_no_boundary
Full suite: 957 passed, 0 failed (pytest tests/ -q).
Ruff clean (ruff check mempalace/i18n/ tests/test_entity_detector.py).

Follow-ups (separate PRs)

ja.json: same treatment (currently falls back to English).
ko.json: same treatment.

Checklist

zh-TW and zh-CN previously had no `entity` section. Calling `detect_entities(..., languages=("zh-TW",))` silently fell back to English patterns (i18n/__init__.py:231-233), so no Chinese names were ever extracted — Chinese-speaking users got zero people or projects detected from their own notes. This adds entity sections for both locales: - `candidate_pattern`: common-surname-prefixed CJK n-grams (~100 surnames covering >95% of Taiwanese / PRC names), length capped at {1,2} trailing chars so greedy matches don't swallow the trailing verb character (e.g. 朱宜振說). - `boundary_chars`: `\u4E00-\u9FFF` so the i18n loader's script-aware wrap (introduced in MemPalace#932) fires `\b` at CJK↔non-CJK transitions. This is the same mechanism used for Devanagari, applied to the CJK range. - `person_verb_patterns`: Chinese verbs attach directly to the name with no whitespace, so patterns are written as `{name}說`, `{name}問`, `{name}決定` — no `\b` or `\s+` separators. - `dialogue_patterns`: full-width colon `：`, Chinese quotes 「」『』, plus the standard Latin forms. - `pronoun_patterns`: 他 / 她 / 它 / 他們 / 她們 / 您 / 咱. - `stopwords`: ~140 common particles, pronouns, time expressions, question words, conjunctions, UI nouns, and politeness forms. **Known limitation** (explicitly covered by a test): CJK scripts have no word delimiters, so a name flanked by CJK on both sides with no punctuation or whitespace break is not extracted. This is a fundamental limit of regex-based CJK entity detection — resolving it would require a dictionary tokeniser. Realistic Chinese technical writing contains enough non-CJK neighbours (bullet lines, inline English, full-width punctuation, newlines) that 3+ occurrences normally produce matches. Verified against a realistic zh-TW PKM note: 朱宜振 extracted 11x from 8 sentences with 0.99 person-classification confidence. **Follow-ups** (separate PRs): same pattern for `ja` and `ko`, both of which currently share the silent fallback-to-English bug. Tests: 7 new tests in `tests/test_entity_detector.py`: - `test_zh_tw_candidate_extraction_at_boundaries` - `test_zh_tw_person_classification` - `test_zh_tw_stopwords_filter_common_particles` - `test_zh_tw_falls_back_to_english_for_non_cjk_names` - `test_zh_cn_candidate_extraction` - `test_zh_cn_and_zh_tw_union_covers_both_variants` - `test_zh_tw_known_limitation_inline_name_no_boundary` Full suite: 957 passed, 0 failed.

Collapse implicit string concatenation to single-line strings to satisfy ruff format --check in CI. Co-Authored-By: Claude <noreply@anthropic.com>

igorls · 2026-04-18T05:07:42Z

I think the new ASCII command-style project patterns in zh-TW.json / zh-CN.json are being neutralized by the script-boundary expansion.

Specifically, patterns like \bimport\s+{name}\b and \bpip\s+install\s+{name}\b go through _expand_b(...) whenever boundary_chars is set for the locale. For Chinese, that rewrites \b into a CJK-transition boundary rather than normal Python word-boundary semantics.

As a result, the expanded regex no longer matches plain ASCII text at all. So these project signals are effectively dead code in the current implementation.

I’d suggest either:

removing \b from those two patterns for the Chinese locales, or
avoiding boundary expansion for explicitly ASCII-oriented patterns like import/pip commands.

Everything else in the PR looked consistent to me, but I don’t think these two patterns currently do what the PR intends.

@messelink

Restore-integrity release. Unbreaks fresh `pip install mempalace` from v3.3.2 by re-tagging current develop, which carries both the plugin.json consumer (shipped in 3.3.2) and the matching mempalace-mcp entry point in pyproject.toml (added on develop ~10h after the 3.3.2 tag via MemPalace#340 by @messelink). MemPalace#1093 diagnosed by @jphein. Bumps (all 5 sources agree per Version Guard / CLAUDE.md): - mempalace/version.py 3.3.2 → 3.3.3 - pyproject.toml 3.3.2 → 3.3.3 - .claude-plugin/plugin.json 3.3.2 → 3.3.3 - .claude-plugin/marketplace.json 3.3.2 → 3.3.3 - .codex-plugin/plugin.json 3.3.2 → 3.3.3 - CHANGELOG.md new [3.3.3] entry No code changes. The fix for MemPalace#1093 is already on develop via merged PRs MemPalace#340, MemPalace#1021, MemPalace#851, MemPalace#942, MemPalace#833, MemPalace#673, MemPalace#661, MemPalace#659, MemPalace#1097, MemPalace#1051, MemPalace#1001, MemPalace#945. Branch name intentionally outside the `release/*` ruleset so follow-up CI-fix commits aren't gated behind a nested PR. (Supersedes MemPalace#1143 — closed for exactly that reason after it missed 3 of 5 version files.) Smoke-tested locally from a fresh develop clone: grep mempalace-mcp pyproject.toml .claude-plugin/plugin.json # both ✓ python -m build --wheel # ✓ pip install …-py3-none-any.whl # ✓ which mempalace-mcp # ✓ mempalace-mcp --help # ✓

lmanchu requested review from bensig, igorls and milla-jovovich as code owners April 16, 2026 09:43

style: fix ruff format for test_entity_detector.py

c88b8a2

Collapse implicit string concatenation to single-line strings to satisfy ruff format --check in CI. Co-Authored-By: Claude <noreply@anthropic.com>

mvalentsev mentioned this pull request Apr 18, 2026

feat(i18n): add entity detection to German, Spanish, and French locales #1001

Merged

6 tasks

igorls merged commit 2a5914b into MemPalace:develop Apr 21, 2026
6 checks passed

igorls mentioned this pull request Apr 21, 2026

feat(searcher): wire i18n stop words into BM25 tokenizer (#973) #977

Open

This was referenced Apr 23, 2026

release: v3.3.3 — restore install integrity #1143

Closed

release: v3.3.3 — restore install integrity #1144

Merged

bensig mentioned this pull request Apr 24, 2026

release: v3.3.3 — sync develop → main for tag cut #1159

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(i18n): add Traditional + Simplified Chinese entity detection#945

feat(i18n): add Traditional + Simplified Chinese entity detection#945
igorls merged 2 commits intoMemPalace:developfrom
lmanchu:feat/zh-entity-detection

lmanchu commented Apr 16, 2026

Uh oh!

igorls commented Apr 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lmanchu commented Apr 16, 2026

Problem

Reproduction (before this PR)

Approach

What you get

Known Limitation (documented in tests)

Testing

Follow-ups (separate PRs)

Checklist

Uh oh!

igorls commented Apr 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants